docs(README): remove degenerate DFlash perf row from #85 perf table by ericjlake · Pull Request #88 · SharpAI/SwiftLM

ericjlake · 2026-04-26T18:39:40Z

Follow-up to #85, which merged with a Qwen3-A3B perf table that included a --dflash row showing 70 tok/s on medium/long prompts. Subsequent benchmarking found that headline number was always degenerate output — "and and and...", "**UMA** **UMA**...", etc. (longest run of identical tokens up to 488 in a row).

Root cause

DFlashRuntime.greedyTokensWithMask uses argMax (pure greedy) for both draft and verify, regardless of the request's temperature. Vanilla SwiftLM samples stochastically at temp=0.6 which breaks ties between high-prob tokens. DFlash's pure greedy decoding has no tie-breaker and locks into low-entropy attractors. Once locked, draft and target both keep predicting the same connective ("and", "UMA", etc.), all 16 positions of every verify pass commit, and the loop self-reinforces. High acceptance + high throughput, but unusable output.

Why we didn't catch it earlier

Server-side [SwiftLM] DFlash summary: ... 70.3 tok/s line reports throughput, not quality.
High acceptance is consistent with degenerate output (target & draft both lock onto the same predictable token).
Snippets visible at the tail of summary log lines ("11", "Let's") were the last few tokens of repetitive runs, not clean prose.

Vanilla generation (no DFlash) on the same 5 prompts: clean output, 60.4 tok/s avg, uniqueness ratios 0.60–0.84.

Mitigation attempts

We added standard repetition penalty (mirrors MLXLMCommon.RepetitionContext) inside DFlashRuntime.greedyTokensWithMask with a 64-token ring buffer. Results across 5 diverse prompts:

Penalty	Clean outputs	Best clean t/s	Notes
1.0 (off)	0/5	—	all degenerate
1.1	1/5	37	most still degenerate; longest-run reduced 488 → still 244 on others
1.3	1/5	15	fixes more attractors but acceptance crashes 80% → 18-46%, throughput tanks below vanilla

Rep penalty is the wrong tool — at 1.1 too weak to dislodge attractors (logit demote only ~9%, attractor gap is often 10+ logit points); at 1.3 strong enough to break loops but also makes target reject draft picks when draft was greedy on a token target wants to slightly demote → DFlash's strict == accept check forces only the first matching position to commit, killing the speedup.

The proper fix is in DFlash itself

This is the same root cause as the 122B SSD-stream finding tracked at z-lab/dflash#91 (acceptanceLen=0|1 → I/O fan-out kills throughput). Both reduce to: DFlash's argmax-greedy verify path can't tolerate sampler-controlled diversity on the target side.

The proper fix is stochastic posterior sampling with rejection-based accept (Leviathan/Chen formulation): target samples from softmax at temperature T; draft proposed token d accepted iff r ~ U(0,1) < min(1, p_target(d) / p_draft(d)). Preserves target distribution and converts the rigid == accept into a probabilistic check that doesn't fall off a cliff on small disagreements. That's a DFlash architecture change, tracked upstream.

This PR

Replaces the misleading --dflash perf row with a clear warning, so users don't adopt a degenerate codepath as the recommended config. Vanilla 60.4 tok/s remains the honest production number for now.

The --dflash flag itself stays in place (no code changes) — the issue is config recommendation, not the implementation. Once the upstream fix lands, we can re-add the row with verified-clean numbers.

Test plan

No code changes; README only.
Same vanilla benchmark numbers (61.7 / 62.3 / 62.1 tok/s) verified clean output via uniqueness-ratio + max-run checks.

References

Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash z-lab/dflash#91 — original 122B SSD-stream + DFlash finding (same root cause)
Qwen3.5-122B-A10B-DFlash + SwiftLM SSD-stream on M1 Ultra: low acceptance + server crash z-lab/dflash#91 (comment) — full diagnosis + suggested fix paths posted upstream

Follow-up to SharpAI#85 (just merged). Subsequent benchmarking discovered the 70 tok/s DFlash medium/long numbers in that PR were ALWAYS degenerate output ("and and and...", "**UMA** **UMA**...") — high acceptance because draft and target both committed to the same locked-in token every block. Root cause: DFlash uses argMax greedy regardless of request temperature. Vanilla samples stochastically at temp=0.6 which breaks ties; DFlash has no tie-breaker and locks into low-entropy attractors. Mitigation experiments (rep-penalty 1.1, 1.3) only partially help: 1.1 is too weak to dislodge hard attractors (1/5 prompts clean), 1.3 fixes attractors but acceptance crashes 80%->18-46% so DFlash becomes net- negative below vanilla. Proper fix is stochastic posterior sampling with rejection-based accept (Leviathan/Chen), tracked at z-lab/dflash#91. Replaces the misleading row with a clear warning so users do not adopt a degenerate codepath as the recommended config. See z-lab/dflash#91 (issuecomment 4322584783) for the full diagnosis.

solderzzc merged commit b11e61e into SharpAI:main Apr 26, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(README): remove degenerate DFlash perf row from #85 perf table#88

docs(README): remove degenerate DFlash perf row from #85 perf table#88
solderzzc merged 1 commit into
SharpAI:mainfrom
ericjlake:fix/readme-dflash-row-cleanup

ericjlake commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericjlake commented Apr 26, 2026

Root cause

Why we didn't catch it earlier

Mitigation attempts

The proper fix is in DFlash itself

This PR

Test plan

References

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants